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Abstract —In this paper, we present a novel learning-aided 
energy management scheme (LEM) for multihop energy harvesting 
networks. Different from prior works on this problem, our al¬ 
gorithm explicitly incorporates information learning into system 
control via a step called perturbed dual learning. LEM does not 
require any statistical information of the system dynamics for 
implementation, and efficiently resolves the challenging energy 
outage problem. We show that LEM achieves the near-optimal 
[0(e), 0(log(l/e)^)] utility-delay tradeoff with an 0(l/e'^“"^^^) 
energy buffers (c € (0,1)). More interestingly, LEM possesses a 
convergence time of 0(l/e^“'^'^^ + 1/e'^), which is much faster 
than the 0(l/e) time of pure queue-based techniques or the 
0(l/e^) time of approaches that rely purely on learning the 
system statistics. This fast convergence property makes LEM 
more adaptive and efficient in resource allocation in dynamic 
environments. The design and analysis of LEM demonstrate how 
system control algorithms can be augmented by learning and 
what the benefits are. The methodology and algorithm can also 
be applied to similar problems, e.g., processing networks, where 
nodes require nonzero amount of contents to support their 
actions. 


I. Introduction 

Recent developments in energy harvesting technologies 
make it possible for wireless devices to support their functions 
by harvesting energy from the environment. For example, by 
using solar panels Cl 12, by harvesting ambient radio power 
12, and by converting mechanical vibration into energy H, 
0. Due to the capability in providing long lasting energy 
supply, the energy harvesting technology has the potential to 
become a promising solution to energy problems in networks 
formed by self-powered devices, e.g., wireless sensor networks 
and mobile devices. 

To realize the full benefits of energy harvesting, algorithms 
must be designed to efficiently incorporate it into system 
control. In this paper, we develop an online learning-aided 
energy management scheme for energy harvesting networks. 
Specifically, we consider a discrete stochastic network, where 
network links have time-varying qualities, and nodes are 
powered by finite capacity energy storage devices and can 
harvest energy from the environment. In each time slot, every 
node decides how much new workload to admit, e.g., sampled 
data from a field, and how much power to spend for traffic 
transmission (or data processing). The objective of the network 
is to find a joint energy management and scheduling policy, 
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so as to maximize the aggregate traffic utility, while ensuring 
network stability and energy availability, i.e., the network 
nodes always have enough energy to support transmission. 

There have been many previous works on energy harvesting 
networks. Works and 12 consider a leaky-bucket like struc¬ 
ture and design joint energy prediction and power management 
schemes for energy harvesting sensor nodes. 0 focuses on 
designing energy-efficient schemes that maximize the decay 
exponent of the queue size. 0 develops scheduling algorithms 
to achieve near-optimal utility for energy harvesting networks 
with time-varying channels. Cl designs an energy-aware 
routing scheme that achieves optimality as the network size 
increases. CD proposes an online energy management and 
scheduling algorithm for multihop energy harvesting networks. 
ifTSll considers joint compression and transmission in energy 
harvesting networks. 02 considers a multihop network and 
proposes a control scheme based on energy replenishment rate 
estimation. 

However, we notice that the aforementioned works either 
focus on scenarios where complete statistical information is 
given beforehand, or try to design schemes that do not require 
such information. Therefore, they ignore the potential benefits 
of utilizing information of system dynamics in control, and 
do not provide interfaces for integrating information collecting 
and learning techniques 02, e.g., sensing and data mining or 
machine learning, into algorithm design. In this work, we try 
to explicitly bring information learning into the system control 
framework. Specifically, we develop a learning mechanism 
called perturbed dual learning and propose a learning-aided 
energy management scheme (LEM). 

LEM is an online control algorithm and does not require any 
statistical information for implementation. Instead, it builds 
an empirical distribution of the system dynamics, including 
network condition variation and energy availability fluctuation. 
Then, it learns an approximate optimal Lagrange multiplier of 
a carefully constructed underlying optimization problem that 
captures system optimality, via a step called perturbed dual 
learning. Finally, LEM incorporates the learned information 
into the system controller by augmenting the controller with 
the approximate multiplier. We show that LEM is able to 
achieve a near-optimal [0(e), 0(log(-)^)] utility-delay trade¬ 
off for general multihop energy harvesting networks with an 
energy storage capacity and resolves the 
energy outage problem. Moreover, we show that by incor¬ 
porating information learning, one can significantly improve 
the algorithm convergence time, i.e., the time an algorithm 
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takes to converge to its optimal operating point; LEM requires 
an log(i)^) time for convergence, whereas existing 

queue-based algorithms require a 0 (l/e) time and algorithms 
based purely on learning the statistics require a 0 (l/e^) time. 
This fast convergence implies that learning-aided algorithms 
can adapt faster when the environment statistics changes, 
which indicates better robustness and higher efficiency in 
resource allocation. 

Learning-aided control with dual learning was hrst devel¬ 
oped in M- In this work, we extend the results to resolve 
energy outage problems in energy harvesting networks via 
a perturbed version of dual learning. Intuitively speaking, 
perturbed dual learning learns a perturbed empirical opti¬ 
mal Lagrange multiplier required for “no-underflow” systems, 
where optimal multipliers must be steered and made trackable 
by queues. 

Our paper is mostly related to recent works la, ini, and 
HD. Specihcally, both || 6 | and HD try to form estimations 
of the harvestable energy rates and utilize the information in 
network control. However, they do not consider the system 
dynamics and do not explicitly characterize network delay 
performance. On the other hand, CD focuses on achieving 
long term performance guarantees without learning. Moreover, 
these three works do not characterize the algorithm conver¬ 
gence speed, which is an important metric for measuring the 
efficiency of control algorithms in learning the optimal system 
operating point in dynamic environments. 

We summarize the main contributions as follows: 

• We propose the Learning-aided Energy Manage¬ 
ment algorithm (LEM) for multihop energy harvest¬ 
ing networks, and show that LEM achieves a near- 
optimal [ 0 (e), 0 (log(-)^)] utility-delay tradeoff with an 
0((i)2/3log(i)2) energy storage capacity. 

• We show that LEM possesses an log(i)^) con¬ 

vergence time. This convergence time is much faster 
compared to the 0 (^) time of existing queue-based 
techniques and the 0 ( 7 )^ time required for approaches 
that purely rely on learning the statistics. 

• We analyze the performance of LEM with the augmented 
drift analysis approach, which handles the interplay be¬ 
tween learning and control and no-underflow constraints. 
This analysis approach can likely hnd applications to 
other similar problems with the no-underflow constraints, 
e.g., processing networks M- 

The rest of the paper is organized as follows. We present 
the system model in Section We explain the algorithm 
design approach and present the LEM algorithm in Section 
and explain the intuition. Then, we present the performance 
results of LEM in Section IV Simulation results are provided 
in Section |V] We conclude the paper in Section VI 


IT The System Model 


We consider a general multi-hop network that operates in 
slotted time. The network is modeled by a directed graph 
Q = where Af = {1,2,...,A^} is the set of nodes 

in the network, and C = n,m G A/”} is the set 

of communication links. We use to denote the set of 

nodes b with [n, b] G £, for each node n, and use 


to denote the set of nodes a with [a, n] G C. We dehne 
dmax — max„ |A/'i°^|) the maximum in-degree/out- 

degree that any node n can have. 

A. The Trajfic and Utility Model 

At every time slot, the network decides how much new 
workload (called packets below) destined for node c to admit 
at node n. We call this traffic the commodity c data and 
use Rn\t) to denote the amount of new commodity c data 
admitted. We assume that 0 < < i?max for all n, c 

with some hnite i?inax > 0 at all time. 

We assume that each commodity is associated with a utility 
function where r"'^ is the time average rate of 

the commodity c traffic admitted into node n, dehned as 
= limt_>oo Ut\r) function 

is assumed to be increasing, continuously differentiable, and 
concave in r with a bounded hrst derivative and ( 0 ) = 0 . 
We dehne P = Ta.ax.n,c{Un^y {Q) the maximum hrst derivative 
of all utility functions. 


B. The Transmission Model 

In order to deliver the admitted data to their destinations, 
each node needs to allocate power to the links for transmission 
at every time slot. To model the effect that the transmission 
rates typically also depend on the link conditions and that 
the link conditions may be time-varying, we denote S{t) 
the network channel state, i.e., the iV-by-W matrix where 
the (n,m) component of S{t) denotes the channel condition 
between nodes n and m. 

Denote P[„_{,](f) the power allocated to link [n,b] at time 
t. At every time slot, if S{t) = Si, the power allocation 
vector P{f) = {P^ri,h\{t)j[n,b] G C) must be chosen from 
some feasible power allocation set We assume that 

■pU.) 

is compact for all Si, and that every power vector 
in satishes the constraint that for each node n, 0 < 

J2b(zjyG) Pin,b] (f) < A’max for some hnite P^ax > 0 . We 
also assume that for any P G setting the entry P[„,f,] 

to zero yields another power vector that is still in 
Given channel state S(t) and power allocation vector P(t), 
the transmission rate over link [n, b] is given by the rate-power 
function P[n,b]{t) = P{t)). 

For each Si, we assume that the function p[n,b]{si,P) 
satishes the following properties: Let P,P' G be such 

that P' is obtained by changing any single component P[n,b] 
in P to zero. Then, (i) there exists some hnite constant k > 0 
that; 

T[n,b] (^2; P) — P[n,b] {tii, P ) “t“ tzP^>fi^b]: (f) 

and (ii) for each link [a, m] f [n, b], 

These properties can be satished by most rate-power functions, 
e.g., when the rate function is differentiable and has hnite 

’In this paper, we assume for clarity that all limits exist with probability 
1. When some limits do not exist, we can obtain similar results by replacing 
limit by lim inf or lim sup, but the results are more involved. 
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directional derivatives with respect to power ini, and when 
link rates do not improve with increased interference. 

We assume that there exists a finite constant /imax such 
that /i[„ {,] (f) < Pmax for all time under any power allocation 
vector P(t) and any channel state S(t). We use to 

denote the rate allocated to the commodity c data over link 
[n, b] at time t. It can be seen that J2c M[n\] (^) — (f) 

all [n, h] and for all t. 


C. The Energy Harvesting Model 

Each node in the network is assumed to be powered by a 
finite capacity energy storage device, e.g., a battery or an ultra¬ 
capacitor ca. We model such a device with an energy queue. 
We use the energy queue size at node n at time t, denoted by 
En{t), to measure the amount of the energy stored at node n 
at time t. Each node n can observe its current energy level 
En{t). In any time slot t, the power allocation vector P(t) 
must satisfy the following “energy-availability” constraint: 

P[n,b](.t) < En{t), V n, (3) 

i.e., the consumed power must be no more than what is 
available. 

Each node in the network is assumed to be capable of 
harvesting energy from the environment, for instance, using 
solar panels m or mechanical vibration 13 . To capture the 
fact that the amount of harvestable energy typically varies over 
time, we use to denote the amount of harvestable energy 
by node n at time t, and denote by h{t) = {hi{t ),..., /lAr(f)) 
the harvestable energy vector at time t, called the energy state. 
We assume that < /imax for all n, t for some finite /imax- 
In the following, it is convenient for us to assume that each 
node can decide whether or not to harvest energy in each slot. 
Specifically, we use e„(f) G [0,to denote the amount 
of energy that is actually harvested at time t. We will see later 
that under our algorithm, e„(f) hn{t) only when the energy 
storage is close to full. 

Denote z{t) = {S{t),h{t)). We assume that z{t) takes 
values in Z = {zi,..., zm}, where Zm = and is 

i.i.d. every time. We denote tt^ = Pr{ 2 ;(t) = z^}- We also 
rewrite as and P) = fJ-in,b](zm, P)- This 

allows arbitrary correlations among the harvestable energy 
processes and channels dynamics, n 


D. Queueing Dynamics 

Let Q{t) = {Qn\t), n,c G Af), f = 0,1, 2,... be the data 
queue backlog vector in the network, where (t) is the 
amount of commodity c data queued at node n. We assume 
the following queueing dynamics: 




(c) 


M 


(f)]' 


( 4 ) 


^We measure time in unit size slots, so that our power P[n,b] W has units 
of energy/slot, and ^](f) x (Islot) is the resulting energy consumption 
in one slot. Also, the energy harvested at time t is assumed to be available 
for use in time f + 1. 

^The i.i.d. assumption is made for ease of presentation. Our results can be 
extended to the case when z{t) evolves according to a general finite-state 
Markovian. 


with (0) = 0 for all n,c G Af, = 0 Vf, and 

[a:]'*' = max[x, 0]. The inequality in (|^ is due to the fact that 
some nodes may not have enough commodity c packets to fill 
the allocated rates. In this paper, we say that the network is 
stable if the following condition is met: 

T=0 n,c 

Similarly, let E(t) = {En(t),n G Af) be the vector of 
energy queue sizes. Due to the energy availability constraint 
(|^, we see that for each node n, the energy queue En{t) 
evolves according to the following: 

En{t + 1) = En{t) - Y P[n,b]{t) + ( 6 ) 

with E„{0) = 0 for all n. Note that with (j^, we start by 
assuming that each energy queue has infinite capacity. We will 
show later that under our algorithm, a finite buffer size is 
sufficient for achieving the desired perfromance. 


E. Utility Maximization 

The goal of the network is to design a joint flow control, 
routing and scheduling, and energy management algorithm to 
maximize the system utility, defined as: 

Utotir) = Y^n\rn. ( 7 ) 

n,c 

subject to network stability 0 and energy availability 0- 
Here r = V n, c G Af) is the vector of the average 
expected admitted rates. We also use r* to denote an optimal 
rate vector that maximizes 0 subject to 0 and 0 - 


F. Discussion of the Model 

This model is general and can be used to model systems 
that are self-powered and can harvest energy, e.g., environment 
monitoring wireless sensor networks, or networks formed by 
mobile cellular devices. The same model was also considered 
in im. There, two online algorithms were developed for 
achieving near-optimal utility performance. In this work, we 
use a very different approach, which explicitly incorporates 
learning into algorithm design and explores the benefits of 
historic system information. Moreover, while previous works 
mostly focus on long term average performance, we also 
investigate the algorithm convergence time, defined to be the 
time it takes for the algorithm (and the system) to learn the 
optimal operating point. 


III. Algorithm Design via Learning 

In this section, we present our algorithm and the design 
approach. To facilitate understanding, we first discuss the 
intuition behind our approach. Then, we provide detailed 
descriptions of the algorithm. 
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A. Design Approach 


We first consider the following optimization problem, which 
can be intuitively viewed as the solution to our problem. 


max : f = V E Pn 

Tl,C 

O(^nc) 


(8) 

s.t. + E 

E 

P"") 

(9) 

m 





E 

(n, 

c) 

m 





e 0 < < R^,^y{n,c) 

^ Pn < M[n,b] (Zm, Pn, V [n, &] 

Here > 1 is a constant and corresponds to a control param¬ 
eter of our algorithm (explained later). Intuitively, problem 
(j^ computes an optimal control policy. To see this, note that 
we can interpret as the traffic admission rate, P™ as 
the power allocation vector under state Zm, and e™ as the 
energy harvesting decision, (j^ represents the queue stability 
constraint and ( [T0| i denotes the energy consumption constraint. 

In practice, one may not always have the statistics (tt^, m) 
a-prior. As a result, online algorithms have been proposed, 
e.g., ESA in O, HU- However, doing so ignores the historic 
system information one can accumulate over time and loses its 
value. In our case, we try to explicitly utilize such information 
and to explore its benefits. Specifically, we will try to build 
an empirical distribution for the system dynamics {zm}m=i- 
Then, we solve a perturbed empirical version of the dual 
problem of (|^ to obtain an empirical Lagrange multiplier 
(we call this step perturbed dual learning). After that, we 
incorporate the empirical multiplier into an online system 
controller (Fig. [T] shows its steps). 


B. Learning-aided Energy Management 

Here we present our algorithm, which consists of an online 
controller and a learning component. We first present the 


algorithm and then explain the controller in Section III-D 
For our algorithm, we need the dual problem of (|8|ii 


min : g{v,v>), s.t. i; A 0, iz S 


t,N 


( 11 ) 

where v = (vn^,y {n,c)), u = (r'„,Vn) are the Lagrange 
multipliers, and the dual function g{v^ v) = i'), 

where gm(v,iy) is dehned as; 

g„(r4,zz)=sup(yVc/(^)(r"'=) 


n.c 






( 12 ) 


be7Vi.°^ 




■K- E 


"^Technically speaking, one has to solve a “convexified” version of to 
find an optimal control policy. But is sufficient for our algorithm design 
and analysis. 


Here the sup is taken over r, G /x, and E ■ In 
the following, we use z/* *) to represent an optimal solution 
of g{v,v). 

We now present the algorithm, which uses a control pa¬ 
rameter V > 1 to tradeoff utility and delay, and specifies a 
learning time for some c G (0,1). 


Empirical Distribution 


Distribution 

Estimation 

ATl) 

Perturbed 

Dual-learning 


Drift-based 

Controller 


Tl. 

Update {Q{t),E{t)) 

with 

' -► 


Fig. 1. There are three main components in LEM: (i) Build an empirical 
distribution 7r(f) for z(t). (ii) Perform perturbed dual learning and obtain 
the empirical optimal multiplier at time Tl. (iii) Incorporate the multiplier 
into the controller. 


Learning-aided Energy Management (LEM): Initialize 

= 0 , = 0 , and set = V° with c G (0,1). At every 

time t, observe Q{t), E{t), z{t), and dehne the following 
augmented queue vectors: 

Q{t) = Q{t)+iQ, E{t)=E{t)+^E. (13) 

Then, do: 

• Energy harvesting: If En{t) — 0n < 0, harvest energy, 
i.e., set en{t) = hn{t). Else set e„(f) = 0. 

• Data admission: For each n, choose Rn^ (f) by solving 
the following optimization problem: 

max : VU^\r) - Q^f\t)r, s.t. 0 < r < i?niax- (14) 

• Power allocation: Define the weight of commodity c data 
over link [n, b] as: 


(15) 


Then, dehne the link weight 1V[„ ;,](0 = max^, 
and choose P{t) G to maximize: 


G{P{t)) 4 ^ 


E 


b'[n,b] (j')^^[n,b] (f) 


(16) 


+ {En{f) — 9n) E] P[n,b](t) 
beAff’^ 

Routing and scheduling: For every node n, hnd any 

c* G argmax^En.V^f 

/^Kb] W = M[n.b] B[u.b]^t) = 0, Vc ^ c*. (17) 

That is, allocate the full rate over link [n, b] to any 
commodity that achieves the maximum positive weight 
over the link. Use idle-hll if needed. 

Queue update and packet dropping: Use Last-In-First- 
Out (LIFO) for packet selection. If for any node n, the 
resulting {P[„ m} in (16 1 violates constraint (|^, set 
P[n,b]{t) ~ En{f) and drop all the packets that 

are supposed to be transmitted. Update [t) and En(t) 
according to 0 and 0, respectively. 

Perturbed Dual-learning at T^: Let Nm be the number 
of times states z^ appear in {0, ...,Ti — 1}. Denote 
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TTmiTL) = the empirical distribution of Zm- Solve: 
lin : ^ TTmiTL)9m{i^, - d), s.t. v^O,v> G K^, (18) 

Ttl 

and obtain the optimal multiplier {v*(Tl),u*{Tl)). 


Change and in (13 i to: 

Ig = v*{TL)-V^-^^logiVf 


(19) 

= i^*(Ti)-Ci-Slog(C)"-l. O (20) 
We will explain the controller in the next subsection. Here, 
we hrst note that the perturbed dual learning step is performed 
only once at time t = T^. Also, although LEM is equipped 
with a packet dropping option to ensure zero energy outage, 
dropping rarely happens, i.e., 0{y~ Moreover, we will 

show that the energy availability constraint is always ensured. 


C. Remarks on LEM 

LEM only requires knowledge of the instantaneous state 
z{t) and queue states Q{t) and E{t). It does not require 
any statistical information about S{t) or any knowledge of 
the energy state process h{t). This is a very useful feature, 
as exact knowledge of the energy source may be difficult to 
obtain at the beginning. 

There is an explicit learning step in LEM. This distinguishes 
it from previous algorithms for energy harvesting networks, 
e.g., d, m, 0, where sufficient statistical knowledge 
of the energy source is often required and no learning is 
considered. We will show in Theorem that LEM converges 
in {Q ^ ^Qg factor), which is much faster 

than the 0(C) time for algorithms based purely on queues, or 
the 0(C^) time for algorithms based purely on learning the 
statistics. 

The perturbation approach here is needed for guaranteeing 
the feasibility of dual learning and the resulting algorithm. 
Specihcally, it “shifts” the optimal Lagrange multiplier to a 
positive value via 0. This step allows us to track the negative 
multiplier with positive queue sizes for decision making, and is 
critical for networks with the “no-underflow” constraint, e.g., 
processing networks US). 

Finally, note that due the general rate functions 
{p^nrn](z, P}}, our problem inevitably requires a centralized 
controller for achieving optimality. Thus, LEM also requires 
centralized implementation. In the special case when network 
links do not interfere with each other and the dynamics are 
all independent, nodes can estimate the local distributions 
and pass the information to a leader node to compute 
{v*{Tifj,u*{Tif)). Then, the leader node sends back the 
multiplier information to the nodes. After that, LEM can be 
implemented in a distributed manner. 


D. Information Augmented Controller 

Here we provide mathematical explanations for our con¬ 
troller. As we will see, the control rules are results of a drift 
minimization principle 1201, augmented by the information 
learned in perturbed dual learning. 

^One can also devise a version of LEM which does continuous learning. 


To start, we dehne a perturbed Lyapunov function as 
follows: 

L{t) =\Y. [Q^n\t)\'+ I E (21) 

Denote Y(t) = {Q{t), E{t)) and dehne a one-slot conditional 
Lyapunov drift as follows: 

A{t)^E{L{t + l)-L{t)\Y{t)}. (22) 

We then have the following lemma from El. 


Lemma 1: Under any feasible data admission action, power 
allocation action that satishes constraint 0, routing and 
scheduling action, and energy harvesting action that can be 
implemented at time t, we have: 

A(f) - I >^(0} (23) 

n,c 

< H + E iEnit) - 0n)E{e„(f) I Y{t)} 

I Y{t)} 


-e{E 

n 




+ {En{t) — 0n) E Rln,b]{t) 


Y{t)}. 


Here B A -p f and 

dmax is dehned as the maximum in-degree/out-degree of any 
node in the network. O 


Proof: See El . ■ 

Now add to both sides of ( [2^ the following drift- 
augmenting term, which carries the information learned in the 
dual learning step, i.e., ^g and 

A^(f)A-E{Ee§J E (24) 

beA/f"^ 

-E{EePn[ E P[n,b]it) - ^n{t)] \ Y(f)}. 

" beAff’ 

Doing so, one obtains the following augmented drift: 

A(f) + A^(f) - UE{ E uR^RP^t)) I Yit)} (25) 

n,c 

< P + E - ^n)Hen{t) I r(f)} 

nGAf 


-E{Y,[vui^HRi^\t))-Qi^\t)RRHt)] I Y{t)} 


-e{E E E 

"■ ^ beArf'> 

+ (P„(f)-0„) E 1^(0}- 

beAft^ 

Comparing ( p5| l and LEM, we see that LEM is constructed to 
minimize the right-hand-side (RHS) of the augmented drift 
(251. This augmenting step is important and provides a way 
to incorporate learning into control algorithm design. 
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IV. Performance analysis 

Here we present the performance results for LEM. We first 
state the assumptions. Then, we present the theorems. 

A. Assumptions 

In our analysis, we make the following assumptions. 

Assumption 1: There exists a constant e = 0(1) > 0 such 
that for any valid distributions ff = (tti, ..., ttm) with ||ff — 
7r|| < e, there exist a set of actions {Rl^^^}keN+, {-Pr}^N+’ 
{f^T}TeN+’ and and distributions and 

(possibly dependent on if), such that (i) there exists 
i?o = 0(1) > 0 independent of tt, so that: 

E (26) 

- E ^ -^ 0 , Vn,c, 

bGA/l°* 

and for each n, 

E = 0 , (27) 

m k m k 

and (ii) 0<J2m'^m Efe < Em Vn. O 

Although Assumption appears complicated, it indeed 

only assumes that the system has a “slackness” property, so 
that there exists a stationary and randomized policy that can 
stabilize the system, and the resulting service rates are slightly 
larger than the arrival rates for the queues. Assumption is 
a necessary condition for achieving network stability and is 
often assumed in network optimization works with e = 0 , e.g., 
ED. Here with e > 0, we assume that systems with slightly 
different channel and harvestable energy distributions can also 
be stabilized with the same slack (the stabilizing policy may 
be different). 

B. Performance Results 

Here we present the performance results. We first define 
the following structural property of the system, which will be 
used in our analysis. 

Definition 1: A system is called polyhedral with parameter 
p > 0 , if the dual function g{v,v) satisfies: 

g{v*, 1 '*) < g{v,v) - p\\[v\i'*) - O (28) 

This polyhedral property often appears in practical systems, 
especially when the control actions are discrete (see ll 2 ^ 
for more discussions). Moreover, ( |28l l holds for all V values 
whenever it holds for V = 1. 

Our first lemma shows that with = yiog(V), one can 
guarantee that at time T^, the empirical multipliers v*{t) and 
iz* (f) are close to their true values with high probability. 

Lemma 2: For a sufficiently large V, with probability 1 — 
O( y 4 iog(v) )’ t = Tl = V‘^ with c G (0,1), one has: 

A e)\\=Q{V^-Hog{V)). (29) 
Here u* A 6 = 0(Vlog(V)) > 0. O 

Proof: See Appendix A. ■ 

Since u* + 6 = 0(V log(V)), we see that the relative error of 
{v*{t), iz*(f)) is quite small. This high accuracy (with respect 
to the size of [v* ,v* + 0)) contributes to achieving a good 


performance and fast convergence rate for LEM. Here u* +6 = 
0(Vlog(V)) > 0 is important, because without 9, we may 
get a non-positive v* after solving (18i, due to the fact that 
( [T0| ) is an equality constraint. In that case, it is impossible to 
use E{f) to track zz* and to base decisions on E{t). 

We now state our first main theorem, which summarizes the 
performance of LEM. 

Theorem 1: Suppose that the dual function g{v, zz) is poly¬ 
hedral with p = 0(1) > 0, i.e., independent of V, and has a 
unique optimal (r;*,zz*) with v* > 0. Then, under LEM with 
0n = y^og{V) and a sufficiently large V, with probability 
1 - Q( v 4 iog(v) ). we have: 


(a) 


The average queue sizes satisfy: 
3, 


Qil < log(l^)" + 0(1), V (n, c), (30) 


En,m <^V 


■5log(V)2 + 0(l), Vn. 


(31) 


In particular, in steady state, there exist 0(1) constants 
D, K such that: 

^ogiVf + D + b}< (32) 


> -l/'-S log(V)" + D + b}< ee-^^ (33) 

(b) For every data queue j with arrival rate Xj > 0, there ex¬ 
ist a set of packets with rate Xj > [Aj —0(l/V*°®*-^^)]“'', 
such that their average delay at queue j is 0(log(V)^). 

(c) Let r = (f"'^, V (n, c)) be the time average admitted 
rate vector achieved by LEM defined in Section II-A We 
have: 

Utotir) > Utotir*) - 0{^). (34) 

Here r* is an optimal solution of our problem. More¬ 
over, no dropping takes places before time Tl and the 
average packet dropping rate is 0 ( 1 /V*°®*-^)). O 


Proof: See Appendix B. ■ 

By taking e = 1/14, we see from Part (a) and Part (b) that 
LEM achieves an [0(e), 0(log(l)^)] utility-delay tradeoff. We 
also see from Part (a) that LEM can use an energy buffer of size 
log(l)^), which is much smaller than the 0 (l/e) 
size under previous algorithms. 

Our second main result concerns the convergence time of 
LEM. The convergence time of an algorithm characterizes 
how fast it (or equivalently, the system) enters its steady 
state. A faster convergence speed implies faster learning and 
more efficient resource allocation. The formal definition of the 
convergence time is as follows ESI: 

Definition 2: Let C > 0 be a given constant. The conver¬ 
gence time of the control algorithm is defined as: 

rc^inf{f : ||(Q(f),i;(f))-(u%zz*+0)|| <C}.0 (35) 

Here the intuition is that once {Q{t), E{t)) gets close to 
(i?*,zz* -I- 9), LEM will start making near-optimal decisions. 


Theorem 2: Suppose the conditions in Theorem hold. 
Under LEM, there exists an 0(1) constant D such that, with 
probability 1 - 0{yr^^), 

E{Tc} = 0(U" + Ui-5log(U)2). (36) 

In particular, when c = |, E{Td} = 0(U5 log(U)^). O 
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Proof: See Appendix C. ■ 

We remark here that if one only uses pure queue-based policies 
to track the optimal multipliers, e.g., ESA in (can be 
viewed as linear learning since each queue can change by an 
0 (1) amount at each time), the convergence time is necessarily 
0(y), since the optimal multiplier is 0(y) ll22l . If instead 
one tries to compute the optimal solution only by learning 
the distribution, it requires <d{V^) time to ensure that the 
distribution is within 0(1/1^) accuracy. Dual learning can be 
viewed as combining the benefits of the two methods, i.e., 
the fast start of statistical learning and the smooth learning of 
queue-based policies. Hence, it is able to achieve a superior 
convergence speed compared to both methods. 



V. Simulation 

This section provides simulation results for LEM. We con¬ 
sider the network shown in Fig. which is an example of 
a data collecting sensor network. In this network, traffic are 
admitted from nodes 1, 2 and 3, and are relayed to node 4. 
Since we only have one commodity, we omit the superscript. 


V = 30, the utility performance is already close to optimal. 
We also see from the middle and the right plots that both 
the average data queues and the average energy queues under 
LEM are of size log(y)^), whereas it is 0(1^) under 

ESA. This implies that one can implement LEM with much 
smaller energy buffers compared to ESA. 



Fig. 2. A data collection network. 

The channel state of each communication link is i.i.d. every 
time slot and can be either “G=Good” or “B=Bad.” The 
probabilities of being in the good state for the links are given 
by p® = (pf 2 :Pi 3 >P 23 .P 24 >P 34 ) = (0.5,0.2, 0.3,0.5, 0.7). For 
each node, the harvestable energy /i„(<) takes values 2 or 0. 
The probabilities of having a 2-unit energy arrival at the nodes 
are = (Pi^P^tPs) = (0.6,0.3, 0.5). Note that we have a 
total of 32 channel states and 8 energy states. 

At every time t, a node can either allocate one unit power 
for transmission or do not transmit. When the channel is good, 
one unit power can support a transmission of two packets. 
Otherwise it can only support one. We assume it!niax = 2 and 
each time i?„(f) G {0,1,2}. The utility functions are given 
by: Ui{r) = 31og(I + r) and (72(r) = 21og(I -|- r), and 
Usir) = log(I + r). We also assume that the links do not 
interfere with each other. 

We simulate LEM with V € {30,40,50,80,100,150} and 
c = 2/3. We choose to begin with 14 = 30 so that dropping 
does not happen. Each simulation is run for 10® slots. In the 
simulation, in order to combact the effect of V not being 
large enough, we slightly increase the learning time from 
to V^^logiy) (same performance can be proven). We also 
reduce V^~‘^/"^\og{VY in (19i and to \og{V), 


the results are not affected. For benchmark comparison, we 
also implement the ESA algorithm in CD.I!] 

Fig. shows the utility and queue performance of LEM. We 
see that LEM achieves good utility performance. Here the small 
improvements for different V values under LEM are because at 


“Other algorithms in the literature are designed for different settings and 
do not directly apply to our problem. 



Fig. 4. Delay scaling under LEM. 

Fig. 0 also shows the average packet delay under LEM, 
computed using the packets that exit the network when the 
simulation ends (This accounts for more than 99.9% of the 
packets that enter the network). It can be seen that the average 
packet delay under LEM grows very slowly, i.e., 0(log(I4)2)^ 
and stays around 12 slots. On the other hand, the delay 
under ESA grows linearly in V, and ranges from 75 to 380 
slots (ESA requires a F > 30 to achieve a similar utility 
performance as LEM). Thus, with the same utility performance, 
LEM achieves a 6 to 30-fold delay saving. 



Ehat/t)i^LEM System slat->. 

- ^ '^converges changes i 



-f: “7 >^Ehayi| 

^ LEM 

re-converges 


Actual energy 
queues 



0 1000 2000 3000 4000 5000 5000 7000 SOOO 9000 10000 


Fig. 5. Convergence comparison between LEM and ESA for V = 150 

Fig. shows the convergence property of LEM. To show the 
convergence, we shorten the time to 10"*^ slots and change the 
system statistics in slot 5000 to p® = (0.3, 0.2,0.2, 0.5,0.7) 








































and — (0.1,0.6, 0.2). We see that LEM converges much 
faster compared to ESA (We only show E{t) here. The data 
queues have similar performance). Indeed, in the first time, the 
energy queue sizes converge to the optimal values (the corre¬ 
sponding optimal Lagrange multiplier values) at around 650 
slots under LEM, whereas ESA converges at around 1500 slots 
(an 850-slot saving!). Then, after we apply the change, LEM 
re-converge after about 450 slots, whereas ESA takes about 
3300 slots to re-adapt to the system (7x faster, save about 
2900 slots!). Moreover, we also note that the actual energy 
queue sizes are barely affected, except for a small change after 
time 5000. This clear demonstrates the effectiveness of using 
dual learning in accelerating the convergence of the algorithm. 

VI. Conclusion 

In this paper, we develop a learning-aided energy manage¬ 
ment algorithm (LEM) for general multihop energy harvesting 
networks. LEM explicitly utilizes historic system information 
and learns an empirical optimal Lagrange multiplier via per¬ 
turbed dual learning. Then, it incorporates the multiplier into 
a drift-based system controller via drift-augmenting. We show 
that LEM is able to achieve a near-optimal utility-delay tradeoff 
with a hnite energy storage capacity. Moreover, LEM possesses 
a provable faster convergence speed compared to existing 
techniques that based purely on queue-based control or based 
purely on learning the statistics. 
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Appendix A - Proof of Lemma[2 


Here we prove Lemma 

Proof: (Lemma ^ Eirst of all, the result i/* + 6 — 
0(Vlog(V)) > 0 follows from Lemma 1 in which states 
that zz* = 0(V). 

Now we see that TTm{t) will converge to with probability 
1 ll23l . Therefore, there exists an 0(1) time Tg, such that 
||7r(f) — 7r|| < e for all t >T^ with probability 1. 

Then, at every time t > T^, we use Theorem 1 in ll24l to 
obtain that: 


g{v*{f),u*{f) -e,t)= < VC/,„ax, (37) 

where (^*(f) is the optimal value of the convexified version 
of (j^ with the empirical distributions ll24ll . the function 
- 0, t) = ^Tnit)g{v, u-9) is introduced 
for convenience, and t/^ax = Utot{Rma,^ ■ !)■ 

Consider Assumption [T] and dehne: 


f}i = mm 


^rn / , Qk ^n,kt / , ^m( / , Qk ^n,k ) 


Then, we see that for any subset I CAf, we can construct a 
policy that ensures ( [26| , and ensures (27 1 with: 

= pi,Vn e I, 


m ^ k k 


= -? 7 i,Vn ^ I. 


m ^ k k 

Denote the set of n’s with > 0. Using the 

dehnition of g{v, v, t), and using the above policy with X = 
N/X+, we have that for all time t > Tg, with probability 1, 

71,C 

+m E VUt) - ^n) - I?1 E Vn(i) - 
<giv*{t),u*{t)-e,t)<VU,n... (38) 
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(38i holds since g{v*{t),i/*{t) — 9,t) achieves the supremum 


over all actions. From (|3^, we have that; 



71,C 

UC/„,ax 

(39) 


Vo 

E 

n 

|H*(f)-6»„| < qe = 

Ut/„,ax 

(40) 

Vi 

Denote 7 ^ = 

^maxMmax -^max 

^ ZD 

Te — max 

“1“ ^max? 


which are the maximum input rates into any data queue 
or energy queue at any time. With ( [39| and ( |40| ), and the 
definition of g{v, v), we see that with probability 1 , for all 
t > i;, 

\gm{v*{t),V*{t) - 0)1 < “FC/max + qdlq + qele- 

Therefore, 

\g{v*{t),u*{t) - e,t) - g{v*{t),u*{t) - 0 )| 

< - nm\\gm{v*{t),V'*{t) - e)\ 
m 

^ ^ ^ '^m\ ^ (^^max QdHq ^e^e) 

m 

< max|7r^ {t) - 7r„| X M(l/t/jnax + qdlq + qele)- (41) 

Denote 5ra{t) = - tt^I and 5tot = 

ma.XjnSm{t)M{VUmax + qdlq + qele)- Using (|^, we 
see then: 

giv*it),u*{t) -6)- giv*,u*) < 25tou (42) 
for otherwise we can use to get: 

g{v*,v*,t) - g{v*{t),v*{t) - e,t) 

< [ 9 {v*,u*) +5tot] - [g{v*{t),v*{t) -9)- Stot] 

< 0 . 

This contradicts with the fact that {v* (t), u*{t)) is the optimal 
solution of g{v, u — 9, t). 

Having established ( |42) l, we can now use the fact that 
g(v,v) is a polyhedral function, i.e., g{v*,v'*) < g{v,v) — 
p||(t;*, u*) — (t), v)\\, to obtain: 

< [g{v*{t),u*{t) -9)- g{v*,u*)]/p 

< 2maX(5m(f)M(U(7max + qdlq + qele)I p- 

m 

Finally, we can use a similar argument as in Appendix H 
of Ha to show that, when V is large enough such that 
I log(U)U“‘^/^ < 1 , at time Tl = V‘^, 

Pr{ maxSm(t) > —. (43) 

Therefore, with probability at least 1 —, we have 
max<5m(f) < Thus, 

Uv*(t),i.*(t))-(v*,,.* +9)H (44) 

^ 8 l 0 g(U)M(Ff/max + qdlq + qele) 

= 0(Ui-Slog(U)). (45) 

This proves the lemma. ■ 

Appendix B - Proof of Theorem [T] 

Here we prove Theorem We will use the following lemma 
and theorems in our analysis. 


Lemma 3: ifTSi Let Q{t) be the size of a single queue with 
dynamics Q{t + 1 ) = [Q{t) — p-{t)]'^ + A{t). Suppose 0 < 
p{t),A{t) < i5niax = 0(1) for all t and that the queue is 
stable. Then, 

/i(f) - A{t) < 5 

max Pr{Q(f) 

*^max }. (46) 

Here x{t) = limT^oo ^ YJJo E{a;(f)}- O 

Theorem 3: ll24ll Let (ti*,!/*) be an optimal solution of 
1 ^. Then, giv^u*) =r> VUtot{r*)- O 

Theorem 4: Suppose that the dual function g{v*,i'*) is 
polyhedral with p = 0(1) > 0, i.e., independent of V, and 
has a unique optimal {v*, u*). Then, under the LEM algorithm 
without learning and augmenting, there exist 0 ( 1 ) constants 
f, K, and D, such that, 

V{D,b)<^e-^\ (47) 

where V{D,b) is defined as: 

V{D,h)^ (48) 

1 

limsup- VPr{||(Q(f),E;(f) -9)- iv*,u*)\\ > D + b}. 

t—^OO 6 ^ 

T=0 

Proof: (Theorem]^ Similar to the proof of Theorem 1 in 
Ea. Omitted for brevity. ■ 

If a steady state distribution exists for Q{t) and E{t), then 

iP(0, b) = limt^oo Pr{ \\(Q{t),E{t) -9)- («*, iz*)|| > D + 
&}. We now present the proof for Theorem [l| We carry out 
the proof in the order of Part (c) - Part (a) - Part (b). Our 
analysis is very different from those in m and ini, due to 
the energy underflow constraint ([^ and learning. Specifically, 
in im, the energy availability is shown to be never active; in 
our case, we introduce packet dropping to ensure zero energy 
outage. 

Proof: (Theorem - Part (c) Utility) To start, first note 
that the dropping mechanism changes the effective energy 
queue evolution to: 

E^{t + l) = {En{t)- Y. f"K 6 ](f))^ + en(f). (49) 

6 GA/ 1 °' 

It is important to note that ( [25| ) still holds under (49 1 . Thus, 
by comparing the RHS of ( |25| l and LEM, we get that; 

A(t) + A^(f) - UE{ ^ Ulf^RilHt)) I l^(f)} (50) 

n^c 


<B- VE{ Y 

71,C 

C/(^)(i?(^)’“'*(f)) 

y{t)} 



+ Y, (Rn{t) 

-0„)E{ef(f)- 

E ^ 

Tn.V^) 1 

^(f)} 

neAf 





-HYQn^ 

m E 

- E 



71,C 


beAff" 

■) 




^^71 

““(f)] 1 

Yit)} 


Here we have rearranged the terms and “alt” stands for 
“alternative,” i.e., the drift expression under LEM holds when 
we plug any alternative control policy into the RHS. 

By comparing LEM and the dual function g{v, v), we see 
that the RHS of ( |50| ) under LEM indeed equals to B — 
g{Q{t), E{t) — 9). Thus, applying Theorem]^ we obtain; 

A(f) + A^(f) - UE{ ^ Ulf\R<fl\t)) I y(f)} 

n,c 
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< B-g{Q{t),E{t)-e)<B-VUtot{r*). 

Taking expectations over Y (t) and summing the above over 
t = 0 ,r — 1 , we have; 

T-l 

E{L(T)-L(0)} + ^E{A^(i)} 

T-l 

<TB- TVUUr*)- 

t—0 n,c 

Rearranging terms, using the fact that Lit) > 0 and L(0) = 0, 
and dividing both sides by VT, we get; 

t—O n,c 

T—1 

t^O 

Using Jensen’s inequality and taking a limit as T —)■ cx), 

T—1 

> UUr-) -T ^ E E{ 

n,c t—0 

To prove ( [T4ll, it remains to show that (i) the last term, 
limT->oo = 0[l/V), and (ii) the frac¬ 

tion of time dropping happens is 0 ( 1 / 1 ^). 


We start with (i). Recall that A^(<) is defined as follows; 

y] (51) 

"■ bGAA,'°> 

- E ^^M^0-R^\t)]\Y{t)] 
-E{ E I 

Notice that we only need to consider Tl < t < T, since = 
= 0 before time Tl- Using Lemma we see that with 
probability 1 — 0{yril^(yj)’ *291 holds at time t = Tl- This 
implies that when V is large, with probability l — 

(52) 


2 log(U)2, V n,c 




(53) 


1 . 


<< + 0„--Ui-Slog(U)2, Vn. 


Combined with (47 1 and the definitions of Q{t) and E{t), 
they imply that, with probability 1 — O( y 4 iog(v) in steady 
state; 

Pr{QW(f) < log(U)2} < iog(v)"-i?) 

Pi{En{t) < log(U)2} < 

Thus, for a sufficiently large V such that log(U)^ > 

D + Umax + ftmax + log(U)2/Ar, One has; 

Pr{Qi^\t) < (54) 

Pr{£;„(f)<Pmax}<e'^-‘°®^''^ (55) 


Using Lemma we conclude then; 


f)GAA( 


(o) 


(c) 


m- H 


(c) 

^^[a,n 


p)-Ri^\t) = 

faGA/T"* 

^ ^ -P[ra, 6 ] (f) ~ e„(f) = 
beAri°'’ 


0(L). 


Together with (52 1 and (53 1 , and that + 6) = 


0 (Ulog(U)), the above shows that with probability 1 — 
O( y 4 iog(v-) i the last term limT^oo = 

0 (1/U). 

It remains to show that dropping happens rarely. We first 


use (55 1 to conclude that the fraction of time the energy queue 
does not have enough power after t = Tl is 0(1/U^). To also 
see that no energy outage occurs before T^, note that before 
Tl, we have Qn\t) = 0(U‘^). This is so since a queue can 
increase by at most 7 , = + dmaxMmax = ©( 1 ) in every 

time slot. Thus, for any link, we have < /3V. Since 

On = Ulog(U) for all n, we see that if En(t) < Pmax^ then 
En{t) — On < —Kf3V. Now, suppose at time t the optimal 
power allocation P has a nonzero component P[„,b] for some 
node n with En{t) < Umax- Then, we can construct vector P 
by setting only P[n,b] to zero. Doing so, we get; 

G{P) - GiP) 

= E E - L[n,b]{S{t),P)]W[n,b]{t) 

"■ bGW,i°' 

+ (E'n(i) ~ On)Pln,b] 

< {tJ-ln,b]iRiOT) - Oln,b]{S{t),P))W[n,b]{t) 

+ {Enit) — On)P[n,b] 


< l3VKP[n,b] - K/3UP[„7] = 0 . 

Here the last two inequalities follow from the properties of 
This contradicts with the fact that P is the optimal vector 
at time t. Thus, no energy outage will occur before time T^. 
This establishes (ii). 

Now note that packet dropping happens only when En (f) < 
Pmax- Thus, using the results from above, we conclude that the 
average packet dropping rate is also 0{1/V'^). Since the utility 
functions have maximum derivative f3 = 0(1), this implies 
that the utility loss due to packet dropping is 0 ( 1 /U^), and 
completes the proof for utility. 

(Part (a) - Queue) We now prove the queueing results. Using 
0 , and ( [53] l, we see that in steady state, 

P^{Q^n\i) > log(U)2 + DLh]< (56) 

Pr{i?„(t) > log(U)2 + D + h]< (57) 

Note that these are exactly ( [32| ) and ( |33| ). From these results, 
we see that ( |30l l and ( |3T| i follow. 

(Part (b) - Delay) To prove Part (b), we note that LEM is 
indeed a Lyapunov drift-based algorithm with the LIFO queue 
discipline (learning happens only at f = Ti). Thus, its delay 
performance follows from Theorem 4 in ll25l . ■ 


Appendix C - Proof of Theorem |2] 
Here we prove Theorem]^ 
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Proof: (Theorem First, using Lemma S we see that at 
time t = Tl = with probability 1 — 0{ yi^(v) we have; 

+ 9)\\<e{V^-HogiV)). (58) 

Using the definitions of Q{t) and E{t), this implies that: 

\\{Qit),E{t))-{v*,u*+e)\\<eiV^-ilog{Vf). (59) 
Then, note that for all t > Tl, LEM is the same as a pure 
drift-based control policy. Hence, using Lemma 2 in ll22l . we 
have that whenever \\{Q(t), E{t)) — {v*, 1 ^*+ 6)\\ > D, where 
D is defined in Theorem 

E{|i(g(< + i),i;(f + i))-K,iy* + e)|| | Q{t),E{t)}m 
< \\{Q{t),Eit))-iv*,u* + e)\\-eo, 

for some eg = ©(I)- Thus, applying Theorem 4 in ifTSl . 
we get that the expected time it takes for (Q(t), E{t)) to 
get into within D distance of {v*,i/* + 6) is no more than 
0((Ui-f log(U)2-f9)/eo). 

Combining this with the fact that the learning time is = 
we have; 

EITd} = 0((Ui-Slog(U)2-i9)/eo + n. (61) 
This completes the proof of Theorem ■ 



